Artificial intelligence performance in detecting lymphoma from medical imaging: a systematic review and meta-analysis

Background Accurate diagnosis and early treatment are essential in the fight against lymphatic cancer. The application of artificial intelligence (AI) in the field of medical imaging shows great potential, but the diagnostic accuracy of lymphoma is unclear. This study was done to systematically review and meta-analyse researches concerning the diagnostic performance of AI in detecting lymphoma using medical imaging for the first time. Methods Searches were conducted in Medline, Embase, IEEE and Cochrane up to December 2023. Data extraction and assessment of the included study quality were independently conducted by two investigators. Studies that reported the diagnostic performance of an AI model/s for the early detection of lymphoma using medical imaging were included in the systemic review. We extracted the binary diagnostic accuracy data to obtain the outcomes of interest: sensitivity (SE), specificity (SP), and Area Under the Curve (AUC). The study was registered with the PROSPERO, CRD42022383386. Results Thirty studies were included in the systematic review, sixteen of which were meta-analyzed with a pooled sensitivity of 87% (95%CI 83–91%), specificity of 94% (92–96%), and AUC of 97% (95–98%). Satisfactory diagnostic performance was observed in subgroup analyses based on algorithms types (machine learning versus deep learning, and whether transfer learning was applied), sample size (≤ 200 or >  200), clinicians versus AI models and geographical distribution of institutions (Asia versus non-Asia). Conclusions Even if possible overestimation and further studies with a better standards for application of AI algorithms in lymphoma detection are needed, we suggest the AI may be useful in lymphoma diagnosis. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-023-02397-9.


Introduction
As a clonal malignancy of lymphocytes, lymphoma are diagnosed in 280,000 people annually worldwide with divergent patterns of clinical behavior and responses to treatment [1].Based on the WHO classification, non-Hodgkin lymphoma (NHL) derived from mature lymphoid cells brings about 6,991,329 (90.36%) disability-adjusted life-years (DALYs), and Hodgkin lymphoma (HL) originated from precursor lymphoid cells accounts for 14.81% of DALYs [2,3].Since about 30% cases of NHL arise in extranodal sites [4], some are considered very aggressive (i.e., Diffuse large B-cell lymphoma in NHL).Early and timely detection of lymphoma are needed to forward the qualified treatment and improve the post-operative quality of life.
Since lymphocyte had diverse physiologic immune function according to lineage and differentiation stage, the classification of lymphomas arising from these normal lymphoid populations is complicated.Imaging is a useful tool in medical science and is invoked in clinical practice to facilitate decision making for the diagnosis, staging, and treatment [5].Despite advances in medical imaging technology, it is difficult for even experienced hematopathologists to identify different subtypes of lymphoma.Diagnosis of lymphoma is firstly based on the pattern of growth and the cytologic features of the abnormal cells, then clinical, molecular pathology, immunohistochemical, and genomic features are required to finalize the identification of certain subtypes [6].However, clinical routine methods that enable tissue-specific diagnosis, such as image-guided tumor biopsy and percutaneous needle aspiration, have the shortcomings of subjectivity, costly, and poor classification accuracy [7].Diagnostic features vary widely (from 14.8 to 27.3%) due to inter-observer variability among experts using multiple imaging methods such as computed tomography (CT), magnetic resonance imaging (MRI), and Whole Slide Image (WSI) in the same sample [8].As diagnostic accuracy of lymphoma depends largely on the clinical judgment of physicians and the technical process of tissue sections, limited health system capacities and competing health priorities in more resource-deprived areas may lack infrastructure and perhaps the manpower to ensure high-quality detection of lymphoma.Therefore, accurate, objective and cost-effective methods are required for the early diagnosis of lymphoma in clinical settings and ultimately provide better guidance for lymphoma therapies.
Artificial intelligence (AI) offers tremendous opportunities in this field.It has the ability to extend the noninvasive study of oncologic tissue beyond established imaging metrics, to assist automatic image classification, and to facilitate performance of cancer diagnosis [9][10][11].As branches of AI, machine learning (ML) [12,13] and deep learning (DL) [8,14] have shown promising results for detection of malignant lymphoma.However, there are no studies systematically assessing the diagnostic performance of AI algorithms in identifying lymphoma.Here, we performed a meta-analysis to assess the diagnostic accuracy of AI algorithms that use medical imaging to detect lymphoma.

Materials and methods
The study protocol was approved on the PROSPERO (CRD42022383386).This meta-analysis was conducted according to the Preferred Reporting Items for Systematic reviews and Meta-analyses (PRISMA) 2020 guidelines [15].Ethical approval was not applicable.

Search strategy and eligibility criteria
In this study, we searched Medline, Embase, IEEE and the Cochrane library until December 2023.No restrictions were applied around regions, languages, participant characteristics, type of imaging modality, AI models or publication types.The full search strategy was developed in collaboration with a group of experienced clinicians and medical researchers (see Additional file 1).
Eligibility assessment was conducted independently by two investigators, who screened titles and abstracts, and selected all relevant citations for full-text review.Disagreements were resolved through discussion with another collaborator.We included all published studies that reported the diagnostic performance of a AI model/s for the early detection of lymphoma using medical imaging.Studies that met the following criteria were included in the final group: (1) Any study that analyzed medical imaging for diagnosis of lymphoma with AIbased models; (2) Studies that provided any raw diagnostic performance data, such as sensitivity, specificity, area under curve (AUC) accuracy, negative predictive values (NPVs), or positive predictive values (PPVs).The primary outcomes were diagnostic performance indicators.Studies were excluded when they met the following criteria: (1) Case reports, review articles, editorials, letters, comments, and conference abstracts; (2) Studies that used medical waveform data graphics material (i.e., electroencephalography, electrocardiography, and visual field data) or investigated the accuracy of image segmentation rather than disease classification; (3) Studies without the outcome of disease classification or not target diseases; (4) Studies that did not use histopathology and expert consensus as the study reference standard of lymphoma diagnosis; (5) Studies that use animals' studies or nonhuman samples; (6) Duplicate studies.

Data extraction
Two investigators independently extracted study characteristics and diagnostic performance data using a predetermined data extraction sheet.Again, uncertainties were resolved by a third investigator.Where possible, we extracted binary diagnostic accuracy data and constructed contingency tables at the reported thresholds.Contingency tables contained true-positive (TP), falsepositive (FP), true-negative (TN), and false-negative (FN) values and were used to determine sensitivity and specificity.If a study provided multiple contingency tables for the same or for different AI algorithms, we assumed that they were independent of each other.

Quality assessment
The quality assessment of diagnostic accuracy studies-AI (QUADAS-AI) criteria was used to assess the risk of bias and applicability concerns of the included studies [16], which is an AI-specific extension to QUADAS-2 [17] and QUADAS-C [18].

Meta-analysis
Hierarchical summary receiver operating characteristic (SROC) curves were used to assess the diagnostic performance of AI algorithms.Hierarchical SROC provided more credibility to the analysis of small sample size, taking both between and within study variation into account.95% confidence intervals (CI) and prediction regions were generated around averaged sensitivity, specificity, and AUCs estimates in Hierarchical SROC figures.Heterogeneity was assessed using the I 2 statistic.We performed subgroup and regression analyses to explore the potential effects of different sample size (≤200 or > 200), diagnostic performance using the same dataset (AI algorithms or human clinicians), AI algorithms (ML or DL), geographical distribution (Asia or non-Asia), and application of transfer learning (Yes or No).The random effects model was implemented since the assumed differences between studies.The risk of publication bias was assessed using funnel plot.
We evaluated the quality of included studies by Rev-Man (Version 5.3).A cross-hairs plot was produced (R V.4.2.1) to better display the variability between sensitivity/specificity estimates.All other statistical analyses were conducted using Stata (Version 16.0).Two-sided p < 0.05 was the threshold for statistical significance.

Study selection and characteristics
Our search initially identified 1155 records, of which 1110 were screened after removing 45 duplicates.1010 were also excluded as they did not fulfill our predetermined inclusion criteria.A total of 100 full-text articles were reviewed, 70 were excluded, and the remaining 30 focused on lymphomas (see Fig. 1) [1,8,[12][13][14].Study characteristics are summarized in Tables 1, 2 and 3.
Twenty-nine studies utilized retrospective data.Only one study used prospective data.Six studies used data from open access sources.Five studies excluded low-quality images, while ten studies did not report anything about image quality.Six studies performed external validation using the out-of-sample dataset, fifteen studies did not report type of internal validation while the others performed internal validation using the in-sample dataset.Seven studies utilized ML algorithms and twenty-three studies used DL algorithms to detect lymphoma.Three studies compared AI algorithms against human clinicians using the same dataset.Among the studies analyzed, six utilized samples diagnosed with PCNSL, six involved samples with DCBCL, four studies focused on ALL, while two studies focused on NHL.Additionally, individual studies were conducted among patients with ENKTL, splenic and gastric marginal zone lymphomas, and ocular adnexal lymphoma.Furthermore, a variety of medical imaging modalities were employed across the studies: six studies utilized MRI, four used WSI instruments, four employed microscopic blood images, three utilized PET/CT, and two relied on histopathology images.

Heterogeneity analysis
All included studies found that AI algorithms were useful for the detection of lymphoma using medical imaging when compared with reference standards; however, extreme heterogeneity was observed.Sensitivity (SE) had an I 2 = 99.35%,while specificity (SP) had an I 2 = 99.68%(p < 0.0001), see Fig. 3.The detailed results of subgroup and meta-regression analyses are shown in Table 4.The heterogeneity for the pooled specificity and sensitivity are still significant within each subgroup, suggesting potential sources of inter-study heterogeneity among studies with different sample sizes, various algorithms applied, geographical distribution and Al algorithms-assisted clinicians versus pure clinicians.However, the results of meta-regression highlight that only difference in AI algorithms and human clinicians remain statistically significant, indicating a potential source of between-subgroup heterogeneity.Furthermore, a funnel plot was produced to assess publication bias, see Fig. 4. The p value of 0.49 suggests there is no publication bias although studies were widely dispersed around the regression line.

Quality assessment
The quality of included studies was summarized in Fig. 5 by using the QUADAS-AI tool.A detailed assessment for each item based on the domain of risk of bias and concern of applicability has also been provided as Fig. 6.For the subject selection domain of risk of bias, fourteen studies were considered a high or unclear risk of bias due to unreported rational and breakdown of training/validation/test sets, derived from open-source datasets, or not performing image pre-processing.For the index test domain, seventeen studies were considered high or at unclear risk of bias due to not performing external verification, whereas the others were considered at low risk of bias.For the reference standard domain, ten studies were considered an unclear risk of bias due to incorrect classification of target condition.

Discussion
To our knowledge, this is the first systematic review and meta-analysis on the diagnostic accuracy of AI in lymphoma using medical imaging.After careful selection of studies with full reporting of diagnostic performance, we found that AI algorithms could be used for the detection of lymphoma using medical imaging with an SE of 87% and SP of 94%.We were strictly in line with the guidelines for diagnostic reviews, and conducted a comprehensive literature search in both medical databases, engineering and technology databases to ensure the rigor of the study.More importantly, we assessed study quality using an adapted QUADAS-AI assessment tool, which provides researchers with a specific framework to evaluate the risk of bias and applicability of AI-centered diagnostic test accuracy.Although our results were largely consistent with previous research, confirming the worries that premier journals have recently raised [5,[44][45][46], none of the previous studies were done specifically on lymphoma.To fulfil this research gap, we strive to identify the best available AI algorithm and then develop it to enhance detection of lymphoma, and to reduce the number of false positives and false negatives beyond that which is humanly possible.Our findings revealed that AI algorithms exhibit commendable performance in detecting lymphoma.
Our pooled results demonstrated an AUC of 97%, aligning closely with the performance of established conventional diagnostic methods for lymphoma.Notably, this performance was comparable to emerging radiation-free imaging techniques, such as whole-body magnetic resonance imaging (WB-MRI), which yielded an AUC of 96% (95% CI, 91-100%), and the current reference standard, 18F-fluorodeoxyglucose positron emission tomography/ computed tomography (18F-FDG PET/CT), with an AUC of 87% (95% CI, 72-97%) [47].Additionally, the SE and SP of AI algorithms surpassed those of the basic method of CT, with SE = 81% and SP = 41% [48].However, the comparison between AI models and existing modalities was inconsistent across studies, potentially attributed to the diverse spectrum of lymphoma subtypes, variations in modality protocols and image interpretation methods, and differences in reference standards [49].
Similar to previous research in the field of image-based AI diagnostics for cancers [5,50,51], we observed statistically significant heterogeneity among the included studies, which makes it difficult to generalize our results with larger sample sizes or in other countries.Therefore, we conducted rigorous subgroup analyses and metaregression for different sample sizes, various algorithms applied, geographical distribution and Al algorithmsassisted clinicians versus pure clinicians.Contrary to 2 Hierarchical SROC curves for studies included in the meta-analysis (16 studies with 124 tables) earlier findings [52], our results displayed that studies with smaller sample sizes and conducted in Asian regions had higher SE compared with other studies.
Significant between-study heterogeneity emerged within the comparison of Al-assisted clinicians and pure clinicians.Despite this, other sources of heterogeneity could  not be explained in the results, potentially attributed to the broad nature of our review and the relatively limited number of studies included.Unlike ML, DL is a young subfield of AI based on artificial neural networks, which are known to have the capabilities to automatically extract characteristic features from images [53].Moreover, it offers significant advantages over traditional ML methods in the early detection and diagnostic accuracy of lymphoma, including higher diagnostic accuracy [8,14], more efficient image analysis [13], and the greater ability to handle complex morphologic patterns in lymphoma accurately [1].Most included studies in this review investigating the use of AI in lymphoma detection employed DL (n = 18), with only six studies using ML.For leukemia diagnosis, the convolutional neural networks (CNN) of DL have been used, e.g., to distinguish between cases with favourable and poor prognosis of chronic leukemia [54], or to recognize blast cells in acute myeloid leukemia [55].However, it requires far more data and computational power than ML methods, and is more prone to overfitting.Some included studies that used data augmentation methods adopting affine image transformation strategies such as rotation, translation, and flipping, to make up for data deficiencies [13,26].The pooled SE using ML methods was higher compared with studies using DL methods (93% VS 86%), while equivalent SP was observed between these two methods (92% VS 94%).We also discovered that AI models using transfer learning had greater SE (88% VS 85%) and SP (95% VS 91%) than models that did not.Transfer learning refers to the reuse of a pre-trained model on a new task.In transfer learning, a machine exploits the knowledge gained from a previous task to improve generalization about another.Therefore, various studies have highlighted the advantages of transfer learning over traditional AI algorithms including accelerated learning speed, reduced data requirements, enhanced diagnostic accuracy, optimal resource utilization, and improved performance in early detection and diagnostic precision of lymphoma [13,56].McAvoy et al. [20].also reported that implemented transfer learning with a highperforming CNN architecture is able to classify GBM and PCNSL with high accuracy (91-92%).Within this review, no significant differences were observed between studies employing transfer learning and those that did not, as well as studies using ML or DL models, potentially indicating limitations stemming from the restricted size of datasets examined in these studies.
Evidence also suggested that AI algorithms had superior SE (91%) and SP (96%), which manifested better performance than independent detection by human clinicians (70 and 86%).Moreover, these differences were the major source of heterogeneity in the meta-regression analysis.Though AI offers certain advantages over physician diagnosis evidenced by faster image processing rates and continuous work, it does not attach importance to all the information that physicians rely on when evaluating a complicated examination.Of the included studies, only three compared the performance of integrating AI with clinicians and pure algorithms, which also restricts our ability to extrapolate the diagnostic benefit of these algorithms in medical care delivery.In the future, the AI versus physicians dichotomy is no longer advantageous, and an AI-physician combination would drive developments in this field and largely reduce the burden of the healthcare system.On one hand, future non-trivial applications of AI in medical settings may need physicians to combine pieces of demographic information with image data, optimize the integration of clinical workflow patterns and establish cloud-sharing platforms to increase the availability of annotated datasets.On the other, AI could perhaps serve as a cost-effective replacement diagnostic tool or an initial method of risk categorization to improve workflow efficiency and diagnostic accuracy of physicians.
Though our review suggests a more promising future of AI upon current literature, some critical issues in methodology needed to be interpreted with caution: Firstly, only one prospective study was identified, and it did not provide a contingency table for meta-analysis.In addition, twelve studies used data from open-accessed databases or non-target medical records, and only eleven were conducted in real clinical environments (e.g., hospitals and medical centers).This is well known that prospective studies would provide more favorable evidence [57], and retrospective studies with data sources in silicon might not include applicable population characteristics or appropriate proportions of minority groups.Additionally, the ground truth labels in open-assessed databases were mostly derived from data collected for other purposes, and the criteria for the presence or absence of disease were often poorly defined [58].The reporting around handling of missing information in these datasets was also poor across all studies.Therefore, the developed models might lack generalizability, and studies utilizing these databases may be considered as studies for proof-of-concept technical feasibility instead of real-world experiments evaluating the clinical utility of AI algorithms.
Second, in this review, only six studies performed external validation.For internal validation, three studies adopted the approach of randomly splitting, and twelve used cross-validation methods.The performance judged by in-sample homogeneous datasets may potentially lead to uncertainty around the estimates of diagnostic performance, therefore it is vital to validate the performance using data from a different organization to increase the generalizability of the model.Additionally, only five studies excluded poor-quality images and none of them were quality controlled for the ground truth labels.This may render the AI algorithms vulnerable to mistakes and unidentified biases [59].
Third, though no publication bias was observed in this review, we must admit that the researcher-based reporting bias could also lead to overestimating the accuracy of AI.Some related methodological guides have recently been published [60][61][62], while the disease-specific AI guidelines were not presented.Since researchers tend to selectively report favorable results, the bias might be likely to skew the dataset and add complexity to the overall appraisal of AI algorithms in lymphoma and its comparison with clinicians.
Fourth, the majority of studies included were performed in the absence of AI-specific quality assessment criteria.Ten studies were considered to have low risk in more than three evaluation domains, while nine studies were considered high risk under the AI-specific risk of bias tool.Previous studies most commonly used the quality assessment of diagnostic accuracy studies (QUA-DAS-2) tool to assess bias and applicability encouraged by current PRISMA 2020 guidance [63], which does not address the particular terminology that arises from AI diagnostic test studies.Furthermore, it did not take into account other challenges that arise in AI research, such as algorithm validation and data pre-processing.QUADAS-AI provided us with specific instructions to evaluate these aspects [16], which is a strength of our systematic review and will help guide future relevant studies.However, it still faces several challenges [16,64] including incomplete uptake, lack of a formal quality assessment tool, unclear methodological interpretation (e.g., validation types and comparison to human performance), unstandardized nomenclature (e.g., inconsistent definitions of terms like validation), heterogeneity of outcome measures, scoring difficulties (e.g.,uninterpretable/intermediate test results), and applicability issues.Since most of the relevant studies were more often designed or conducted prior to this guideline, we accepted the low quality of some of the studies and the heterogeneity between the included studies.
This meta-analysis has some limitations that merit consideration.Firstly, a relatively small number of studies were available for inclusion, which could have skewed diagnostic performance estimates.Additionally, the restricted number of studies addressing diagnostic accuracy in each subgroup, such as specific lymphoma subtypes and medical imaging modalities, prevented a comprehensive assessment of potential sources of heterogeneity [65,66].Consequently, the generalizability of our conclusions to diverse lymphoma subtypes and varied medical imaging modalities, particularly without the integration of AI models at this current stage, could be limited.Secondly, we did not conduct a quality assessment for transparency since current diagnostic accuracy reporting standards (STARD-2015) [67] is not fully applicable to the specifics and nuances of AI research.Thirdly, several included studies have methodological deficiencies or are poorly reported, which may need to be interpreted with caution.Furthermore, the wide range of imaging technology, patient populations, pathologies, study designs and AI models used may have affected the estimation of diagnostic accuracy of AI algorithms.Finally, this study only evaluated studies reporting the diagnostic performance of AI using medical image, which is difficult to extend to the impact of AI on patient treatment and outcomes.
To further improve the performance of AI algorithms in detecting lymphoma, based on the aforementioned analysis, focused efforts are required in the domains of robust designs and high-quality reporting.To be specific, firstly, a concerted emphasis should be directed towards fostering an augmented landscape of multi-center prospective studies and expansive open-access databases.Such endeavors can facilitate the exploration of various ethnicities, hospital-specific variables, and other nuanced population distributions to authenticate the reproducibility and clinical relevance of the AI model.Therefore, we suggest the establishment of interconnected networks between medical institutions, fostering unified standards for data acquisition, labeling procedures and imaging protocols to enable external validation in professional environments.Additionally, we also call for prospective registration of diagnostic accuracy studies, integrating a priori analysis plan, which would help improve the transparency and objectivity of reporting studies.Second, we would encourage AI researchers in medical imaging to report studies that do not reject the null hypothesis, which might improve both the impartiality and clarity of studies that intend to evaluate the clinical performance of AI algorithms in the future.Finally, though timeconsuming and difficult [68], the development of "customized" AI models tailored to specific domains, such as lymphoma, head and neck cancer [69], or brain MRI [70], emerges as a pertinent suggestion.This tailored approach, encompassing meticulous preparations such as feature engineering and AI architecture, alongside intricate calculation procedures like segmentation and transfer learning, could yield substantial benefits for both patients and healthcare systems in clinical application.

Conclusions
This systematic review and meta-analysis appraised the quality of current literature and concluded that AI techniques may be used for lymphoma diagnosis using medical images.However, it should be acknowledged that these findings are assumed in the presence of poor design, methods and reporting of studies.More highquality studies on the AI application in the field of lymphoma diagnosis with adaption to the clinical practice and standardized research routines are needed.

Fig. 1
Fig. 1 PRISMA flow chart outlining the selection of studies for review

Fig. 3
Fig. 3 Cross-hair Plot of studies included in the meta-analysis (16 studies with 124 tables)

Table 1
Participant demographics for the 24 included studies

First author and year Participants Inclusion criteria Exclusion criteria N Mean age (SD; range, year)
Patients who had undergone surgical resection, radiotherapy, chemotherapy, and/or bone marrow transplantation as well as those with other malignan-Patient age < 18 years; missing clinical information; receipt of hormone therapy before undergoing MRI; no data on enhanced MRI; lesions not in the cerebral parenchyma; and MR images with obvious artifact.

Table 2
Model training and validation for the 24 included studiesNR not reported, MCL mantle cell lymphoma, PCNSL primary central nervous system lymphoma, DLBCL diffuse large B-cell lymphoma, HGL high grade lymphomas, FL follicular lymphoma, BL burkitt lymphoma, SLL small lymphocytic lymphoma, ENKTL nasal-type extranodal natural killer/T cell lymphoma, NHL non-Hodgkin's lymphoma, ALL acute lymphoblastic leukemia, OAL ocular adnexal lymphoma, RL reactive lymphoid hyperplasia

Table 3
Indicator, algorithm, and data source for the included studies

Table 3 (continued) First author and year Indicator definition Algorithm Data source Method for predictor Exclusion of poor quality imaging Heatmap provided Algorithm architecture Transfer learning applied Source of data Number of images for training/internal/ externalTable 3 (continued) First author and year Indicator definition Algorithm Data source Method for predictor Exclusion of poor quality imaging Heatmap provided Algorithm architecture Transfer learning applied Source of data Number of images for training/internal/ external Data range
NR not reported, MRI magnetic resonance imaging, mpMRI multi-parametric MRI, WSI whole slide image, CNN convolutional neural network, DCNN deep convolutional neural network, IF-CNN image-level fusion based multi-parametric CNN, CART classification and regression tree, DLCNN deep learning convolutional neural network, ResNet residual network, BNN bayesian neural network, LASSO least absolute shrinkage and selection operator, ANN artificial neural network, DSCNet deep skip connections-based dense network, MIL multiple instance learning algorithms

Table 4
Summary estimate of pooled performance of artificial intelligence in lymphoma detection a .P-Value for heterogeneity within each subgroup b .P-Value for heterogeneity between subgroups with meta-regression analysis